A novel hybrid feature selection and ensemble-based machine learning approach for botnet detection

In the age of sophisticated cyber threats, botnet detection remains a crucial yet complex security challenge. Existing detection systems are continually outmaneuvered by the relentless advancement of botnet strategies, necessitating a more dynamic and proactive approach. Our research introduces a ground-breaking solution to the persistent botnet problem through a strategic amalgamation of Hybrid Feature Selection methods—Categorical Analysis, Mutual Information, and Principal Component Analysis—and a robust ensemble of machine learning techniques. We uniquely combine these feature selection tools to refine the input space, enhancing the detection capabilities of the ensemble learners. Extra Trees, as the ensemble technique of choice, exhibits exemplary performance, culminating in a near-perfect 99.99% accuracy rate in botnet classification across varied datasets. Our model not only surpasses previous benchmarks but also demonstrates exceptional adaptability to new botnet phenomena, ensuring persistent accuracy in a landscape of evolving threats. Detailed comparative analyses manifest our model's superiority, consistently achieving over 99% True Positive Rates and an unprecedented False Positive Rate close to 0.00%, thereby setting a new precedent for reliability in botnet detection. This research signifies a transformative step in cybersecurity, offering unprecedented precision and resilience against botnet infiltrations, and providing an indispensable blueprint for the development of next-generation security frameworks.

training.Moreover, this training and testing time is comparatively higher than other ensemble techniques 44 .Srinivasan et al. 13 enhanced cyber security efforts by deploying an ensemble classification-based machine learning technique for botnet detection, achieving a notable accuracy of 94%.However, this approach was marked by a relatively high False Positive Rate (FPR) and a lower-than-desired True Positive Rate (TPR).Additionally, the implementation of their stacking ensemble method was time-intensive during the training and testing phases, posing constraints on efficiency.
The suggested botnet detection approach represents a considerable improvement over the single-classifierbased models' limitations.The model exhibits higher performance across numerous publically available datasets by integrating hybrid feature selection strategies with an additional trees ensemble classifier.The model can take advantage of the combined intelligence of numerous decision trees by using an additional ensemble classifier for trees.By using an ensemble approach, the model is better able to handle the complex and varied patterns found in botnet traffic, which boosts its detection capabilities.The suggested model exhibits robustness across diverse publically available datasets, in contrast to standard single classifier-based models that are limited to particular datasets.Its capacity for generalization and adaptability to new, undiscovered botnet kinds is highlighted by its ability to perform well on various datasets.The model outperforms single-classifier-based techniques and shows a significant improvement in accuracy.Additionally, it achieves a greater True Positive Rate (TPR), demonstrating its capacity to correctly identify instances of botnets, while retaining a lower False Positive Rate (FPR), minimizing the number of occasions where regular traffic is mistakenly classified as botnet activity.In fact, the proposed model's increased performance metrics make it a very effective and strong tool for combating botnet threats across a variety of domains, such as the Internet of Things, Android, Industry 4.0, computer systems, and network environments.

Proposed model developing
The model development in machine learning enables intelligent systems to make accurate predictions, optimize processes, uncover hidden patterns, provide personalized experiences, handle large-scale data, and improve efficiency.Below is a detailed description of the proposed model.The development pipeline of the proposed model is given in Fig. 1.

Dataset
The dataset plays a pivotal role in evaluating machine learning models and is a cornerstone of the entire model development process.A well-chosen dataset serves as the foundation for training, validating, and testing the model's performance.It is instrumental in assessing the model's capabilities, generalization ability, and potential to make accurate predictions on unseen data.In this research, an assortment of datasets has been employed to evaluate the proposed model thoroughly.Below is a short summary of the datasets that have been used.

N-BaIoT
A well-known and frequently used dataset in the area of botnet identification, particularly for Internet of Things (IoT) devices, is the N-BaIoT dataset 16 .It was carefully created from nine commercial Internet of Things (IoT) devices that had really been compromised by the well-known botnets Mirai and BASHLITE.The dataset's emphasis on actual botnet infections in IoT devices makes it extremely pertinent for researching and comprehending the security issues unique to IoT ecosystems.IoT device vulnerabilities are frequently used by the Mirai and BASHLITE botnets to launch massive distributed denial-of-service (DDoS) assaults.With 116 features, the dataset delivers an enormous amount of information.These attributes, which are necessary for creating reliable botnet detection models likely include a wide range of characteristics, such as network traffic traits, communication patterns, and protocol-specific information.We took samples from the first IoT device out of the nine contaminated devices for our research.There were 1,018,298 samples in this subset.Table 2 lists the number of samples for each class.In the "Appendix" section, an explanation of each feature of the dataset is provided.

Bot-IoT
The BoT-IoT dataset is a valuable tool created to support research on botnet detection and IoT (Internet of Things) security.It was extensively developed in the Cyber Range Lab at UNSW Canberra to produce a realistic network environment 17 .The dataset stands out due to its accurate depiction of an IoT network environment.In this environment, which closely resembles real-world IoT system environments, there is a mix of effective (normal) and malicious (botnet) relations.46 features make up the dataset, and they offer crucial details regarding the peculiarities of network traffic.These features most likely consist of IoT-specific parameters, communication protocols, and other relevant information that help identify and categorize botnet activities.It is a comprehensive dataset that includes examples that are typical of many botnet kinds and covers a variety of botnet attacks.The information is carefully tagged to indicate the many types of botnet attacks that are there.Researchers and security professionals can create and assess models for IoT security, intrusion detection, and botnet mitigation using the BoT-IoT dataset.The dataset is a useful tool for comprehending and tackling security concerns in the developing Internet of Things space because it focuses on IoT devices and includes actual traffic scenarios.

CTU-13
Another well-known dataset for botnet identification and network security analysis is the CTU-13 dataset 18 .It originated in 2011 at CTU University with the main goal of supplying a sizable and varied collection of genuine botnet traffic mixed with regular background traffic.This dataset is crucial for testing and creating intrusion detection systems and botnet detection models.There are thirteen distinct captures or scenarios in the dataset.Each instance of botnet traffic represented by a scenario has particular characteristics, such as the type of malware utilized, the protocols employed, and the operations carried out.The dataset contains actual botnet traffic, which makes it useful for examining actual botnet behavior and comprehending the various strategies used by botnet operators.The CTU-13 dataset includes botnet traffic as well as regular and background traffic to replicate realistic network conditions.Researchers can create models that can differentiate between secure and harmful network activity due to this diversity.In our research, we employed about 1,525,249 samples, each of which was described by ten features.These properties of the network traffic, such as packet headers, flow statistics, or communication patterns, are mainly captured by these features.

ISCX
The ISCX dataset was designed to provide researchers access to a wide range of network traffic information for analyzing and discovering network intrusions, such as botnet activity.Data on network traffic was collected from a regulated network environment to construct the dataset.In addition to other sorts of attacks, such as but not limited to botnet attacks, it also contains benign (regular) traffic.It consists of 79 features that come from various facets of network traffic.These functions offer data on different network protocols, traffic patterns, and statistical characteristics.This dataset is frequently used, especially in the context of botnet detection, to assess the efficacy of intrusion detection models and approaches.The dataset is a great resource for researching botnet and network intrusion detection due to its comprehensiveness and labeled cases 19 .

CCC
The Cyber Clean Center (CCC) dataset is a commonly used publicly accessible dataset in the field of cybersecurity research.It has been used in many research studies to evaluate how well various models work at identifying and thwarting botnet-based assaults.The four separate sub-datasets in the dataset are C08, C09, C10, and C13.The CCC dataset includes traffic packets that are linked to particular port numbers within these sub-datasets.IRC (Internet Relay Chat) traffic is routed through port 6667, while HTTP (Hypertext Transfer Protocol) traffic is routed through port 80 45,46 .
This dataset, which consists of a total of 56 features, is a useful tool for developing and testing models intended to differentiate between botnet traffic and regular traffic.The dataset's test feature divides traffic into two categories normal traffic and botnet traffic providing a foundation for evaluating the effectiveness of various detection and prevention methods.

CICIDS
The CICIDS dataset 47,48 is a popular benchmark dataset used in the detection of botnets.It was created by the Canadian Institute for Cybersecurity and includes a significant amount of network traffic that was recorded in a genuine corporate network environment.The dataset contains both good and bad network traffic, including several botnet activities including DoS, port scanning, DDoS, and bot attacks.The dataset has been used to assess how well different deep learning and machine learning approaches can identify botnet traffic.Its use has resulted in the development of more accurate and efficient botnet detection systems that can help enhance the security of computer systems and networks.

Experimental setup
In the experimental phase of this research, we established a computational environment using an 11th Gen Intel(R) Core(TM) i7-11,700 processor with 16 GB of RAM, running a 64-bit version of Windows 11 Pro.Our analyses were conducted using Python within the Jupyter Notebook interface, leveraging the powerful Scikit-learn library to implement machine learning models.We harnessed an array of Python libraries and methods to implement and evaluate our model.Key libraries included pandas for data handling, matplotlib for data visualization, imblearn.over_samplingfor applying SMOTE to combat class imbalance, and sklearn.preprocessing for feature standardization and encoding with StandardScaler and LabelEncoder.Feature selection was conducted using sklearn.feature_selection'sSelectKBest with mutual_info_classif, as well as PCA from sklearn.decomposition for dimensionality reduction.The machine learning models were built using ensemble methods from Scikit-learn, such as ExtraTreesClassifier, XGBClassifier, RandomForestClassifier, BaggingClassifier, and StackingClassifier with decision tree and logistic regression base estimators.We also utilized SVC, DecisionTreeClassifier, GaussianNB, MLPClassifier, and XGBClassifier for a diverse set of predictive models.For model training and validation, the train_test_split method was integral in creating subsets of data.Specifically, we allocate only 20% of the data for testing purposes, reserving the remaining 80% for training.Model performance was quantified using a suite of evaluation metrics from sklearn.metrics, including accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, cohen_kappa_score, roc_auc_score, and roc_curve.System performance and efficiency were monitored through the time and psutil libraries, ensuring an accurate log of model training and prediction times along with system resource utilization.These tools and methods collectively formed the backbone of our experimental framework, facilitating a thorough investigation into the efficacy of our novel botnet detection approach 49 .

Data preprocessing
In this research, we utilized the mentioned Botnet Detection Datasets for our experiments.We started the preprocessing stage by checking for missing values and duplicates.Using the "duplicated()" function, we first examine the dataset for any duplicate entries.Duplicate rows are checked for and eliminated from the dataset.The "drop_duplicates()" function is used to achieve this, guaranteeing that each distinct data item is displayed only once.For many algorithms, infinity and extremely large values are difficult.Such values are substituted by NaNs (Not-a-Numbers) to address this problem.We substitute certain patterns, such as scientific notation or numeric strings, with NaN values using regular expressions.The dataset is then eliminated any rows with NaN values.To ensure data integrity and avoid mistakes during modeling, this step is essential.After that, label encoding is a process used to convert categorical labels into numerical values which is applied.Since most machine learning algorithms work with numeric data, categories must be mapped to numbers, making the data suitable for these models to process.The "LabelEncoder" class in scikit-learn performs this conversion by associating each unique category with a number.The Synthetic Minority Over-sampling Technique, commonly abbreviated as SMOTE, represents a robust approach to mitigate the imbalance in datasets by generating synthetic data points.The principal strategy of SMOTE involves creating new examples that expand the minority class, effectively augmenting the dataset without the repetition involved in traditional over-sampling.This process is performed by creating synthetic minority class instances in the feature space neighborhood of existing minority examples, thereby enriching the dataset's diversity and aiding in the creation of more generalizable models.Let us denote our feature space as X with n features, where X ∈ R m * n , and m is the number of instances.Correspondingly, the target variable y which denotes class labels is binary, with y i ∈ {0, 1} here 0 represents the majority class and 1 represents the minority class.The goal is to balance the dataset by altering the distribution of y such that y' approaches a uniform distribution.For each minority class sample x i , SMOTE calculates the k-nearest minority class neighbors.A synthetic instance is then generated by choosing one of the k-nearest neighbors, x nn , and constructing a new instance x new as follows: where is a random number between 0 and 1.This operation constructs a new sample point that lies on the line segment between x i and x nn in the feature space.The procedure can be formally expressed through the fol- lowing steps: i.For a minority class instance x i , identify its k-nearest neighbors using a chosen distance metric, typically Euclidean distance: ii. Randomly select k′ neighbors from the k-nearest neighbors, where k ′ ≤ k.
iii.Generate synthetic instances using the interpolation formula given above for each selected neighbor.iv.Repeat this process until the desired class proportion is achieved.The dataset balancing through SMOTE in this research not only enhances the representation of the minority class but also contributes to a broader and potentially more accurate exploration of the feature space, resulting in a richer hypothesis space for the subsequent learning algorithms.This iterative refinement of the training set aims to yield a machine learning model with improved generalization capabilities across the botnet detection domain.
Overall, our preprocessing stage ensured that the dataset was clean and ready for feature engineering.

Feature engineering
The feature engineering steps start with handling infinite values and large floats.All infinite values with NaN (Not a Number), making sure that no infinite or unreasonably large values that could skew the data are included.The regular expression targets strings formatted as floats (both in standard decimal and scientific notation) and integers, replacing them with NaN, which appears to be a mistake since it may inadvertently convert all numeric values to NaN.After replacing certain values with NaN, rows containing these NaNs are dropped from the dataset, ensuring the dataset has no missing or infinite values which could potentially cause errors during modeling.The "StandardScaler" normalizes the numerical columns by subtracting the mean and scaling to unit variance.This standardization of features is important since it ensures that each feature contributes equally to the distance computations in machine learning algorithms.Normalization of numerical columns through the "StandardScaler" involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1.This process is known as Z-score normalization and can be mathematically represented as follows: Let X be a matrix where each column represents a feature and each row represents an observation.The standard score (Z-score) for a single scalar value in a feature column can be calculated using the formula: where x is the original value, μ is the mean of the feature column, σ is the standard deviation of the feature column.
For a X j feature column with n observations, the mean µ j is calculated as: (1) The standard deviation σ j for the same feature column is given by: The "fit_transform" method of "StandardScaler" first computes µ j and σ j for each feature in the fit phase.It then applies the above Z-score normalization to each element of the feature column in the transform phase, producing a new matrix where all features are standardized.
The processed dataset is now ready for the selection of relevant features.

Feature selection and combining relevant features
In order to distinguish between botnet traffic and regular traffic, feature selection techniques including CA, MI, and PCA can assist discover the most organization capability in the dataset.CA 50 measures the linear relationship between two variables.In botnet detection, it can identify features that have a high correlation with the target variable, which can help differentiate between botnet and normal traffic.MI 51 is a statistical measure that quantifies the amount of information that one feature provides about another.Even if the connection is nonlinear, it may be used to find characteristics in botnet detection that have a substantial correlation with the target variable.By transferring the data to a lower-dimensional space, the PCA approach decreases the dimensionality of the data.By locating the directions in the data that have the most variation, it can assist in identifying the most crucial properties.It can assist in botnet identification by locating the dataset's most important properties, which can help distinguish between botnet traffic and regular traffic 52 .Algorithm 1 illustrates the algorithm for the selection of relevant features for the detection of botnets. (4) Inputs-X: Feature matrix with dimensions [ , ], y: Target labels with dimensions [ ], k: Number of features to select (e.g., 20) Output-: Set of selected relevant features 1. Correlation Analysis: i.
Calculate the Pearson correlation coefficient matrix for the feature matrix X: -µ ), X is the feature matrix, is the number of samples, and µ is the mean of each feature ii.
Compute the absolute values of the correlations: Identify features with correlations greater than a threshold (e.g., 0.5) as relevant, and store them in 2. Mutual Information: i.
Calculating mutual information between each feature and the target variable y: , ( , ) is the joint probability distribution, and ( ) and ( ) are marginal probabilities.ii.
Selecting the top k features with the highest mutual information scores and store them in .3. Principal Component Analysis (PCA): i.
Perform PCA to reduce the dimensionality of X to (e.g., 20) principal components.ii.
PCA finds the eigenvectors 1 , 2 ,…… that correspond to the largest eigenvalues of the covariance matrix: iii.
Determine which original features contribute most to each principal component and store them in .4. Combine Relevant Features: i.
Features combination obtained from correlation analysis, mutual information, and PCA.= ∪ ∪ 5.Return as the selected set of relevant features for botnet detection.www.nature.com/scientificreports/

Algorithm 1 Selection process of relevant features
The three different feature selection techniques selects the most relevant features for botnet detection, which allowed us to achieve better performance in subsequent stages of our research.

Ensemble method selection
Among the many ensemble approaches extra trees, extreme gradient boosting, bagging, random forest, and stacking are considered to evaluate the evaluation metrics in this research.Here is a brief overview of these ensemble methods:

Model with improved extra trees ensemble method
The Extra Trees ensemble classifier is a powerful machine learning technique for classifying traffic into multiple types of botnets or normal categories.It leverages the strength of multiple decision trees to achieve accurate and robust predictions.In the Extra Trees ensemble classifier, each decision tree is trained on a randomly selected subset of relevant features from the input dataset.The split points at each node are chosen based on maximizing the information gain or reducing the impurity criterion, such as Gini impurity or entropy 53,54 .
The model with the Extra Trees ensemble classifier effectively categorizes traffic instances into multiple types of botnets or normal classes.It can handle the complexity and diversity of different botnet types and generalizes well to new, unseen data, making it a versatile and reliable tool for traffic classification tasks.The process for the detection of botnets using extra trees ensemble is given in Algorithm 2.
In the above algorithm, I G (S) is the Gini impurity for a set S, with p positive instances and n negative, D x, f , t s is a function with data point, feature, and threshold.D(x, feature, threshold) returns "left" if x[feature] ≤ threshold and "right" otherwise.I G,left is the gini impurity for the left child node and I G,right is the right child node based on feature tests.
The reason we chose to present the detection process in Algorithm 2 is due to the superior performance of the model with the Extra Trees ensemble.However, for the other ensemble-based approaches, we have provided the hyperparameters within the respective classifier descriptions.

Bagging ensemble method
The Bagging ensemble classifier is a powerful machine learning technique employed for classification tasks, designed to enhance predictive performance and reduce overfitting.It achieves this by creating an ensemble of multiple base classifiers, each trained on different bootstrap samples randomly drawn from the training dataset.The final prediction is determined through majority voting, where each base classifier's output contributes to the ultimate decision 37,55 .
For a classification problem like botnet detection, the final prediction ŷ in Bagging ensemble is obtained through majority voting: where ŷ represents the final predicted class label (classification output) and h t (X) denotes the prediction of the t-th base classifier for the input features X.Each base classifier h t produces a class label prediction based on the input features.
The number of base classifiers in the ensemble is used by n_estimators = 10.The size of bootstrap samples used for training by max_samples = 1.0, indicating that each classifier is trained on bootstrap samples of the same size as the original training set.The number of features considered during training for each base classifier with max_features = 1.0, all features are utilized during training.A boolean value determines whether bootstrap sampling is enabled (True) or not (False) with bootstrap = True, allowing bootstrap sampling.A boolean value indicates whether feature bagging is applied (True) or not (False).We used, bootstrap_features = False, to imply that feature bagging is not utilized.A random seed is used for reproducibility whose value is set to 42.
By harnessing the diversity of the base classifiers and reducing variance, Bagging provides an ensemble model that excels in capturing complex patterns within the data and yields more robust predictions compared to individual classifiers.The scikit-learn implementation, with its adjustable hyperparameters, offers flexibility in controlling the ensemble's size, feature selection, and randomization process, empowering users to optimize the classifier for various classification tasks 56 .

Boosting ensemble method
XGBoost stands as a paragon of ensemble learning, renowned for its efficiency and prowess in handling diverse datasets with a sophisticated boosting mechanism 57 .It epitomizes an advanced gradient-boosting framework, employing a constellation of decision trees to progressively minimize errors and improve predictive accuracy, making it an invaluable tool for tackling intricate classification challenges such as botnet detection.
The XGBoost classifier's training process for botnet detection utilizes an ensemble of decision trees constructed iteratively to correct the predecessors' errors.The objective L at each iteration combines the loss function used to measure the prediction's discrepancy from the true value and a regularization term to penalize model complexity, formally described as: where y i represents the true label for instance i, ŷi is the corresponding prediction, and f k represents each deci- sion tree in the ensemble.The regularization term f k is defined as: with T denoting the number of leaves in tree f k , ω the scores on leaves, γ he complexity cost added for each additional leaf, and the L2 regularization term on the leaf weights.The hyperparameters utilized in the given code segment specify a learning rate (or "eta") of 0.1, a fixed number of 10 estimators for the number of boosting rounds, and a max depth of 3 for individual trees.These hyperparameters dictate the learning trajectory.

Random forest ensemble method
To enhance the model's overall performance, ensemble approaches mix numerous models.The ensemble approach is employed in this instance to detect botnets using the Random Forest algorithm.The number of estimators, the criterion, the minimum samples split, the maximum depth, the minimum samples leaf, and the maximum features are the initial hyperparameters for the Random Forest classifier.With this ensemble approach, the values of the evaluation metrics are likewise checked using the same four classifiers 40 .
Input: selected relevant features, ( ) 1. Initialize the Extra Trees Classifier i.
Define the number of decision trees in the ensemble as = 10.ii.
Set a random seed for reproducibility using = 42. iii.
Determine the number of features to consider for splitting at each node using .

Training the Model i.
For each decision tree tt in the ensemble ( = 1, 2,…, ): a. Randomly select a subset of features ( ) from the relevant feature .b. Build the decision tree t by recursively selecting the best feature (f) and split point (s) at each node to minimize Gini Impurity.c.At each node, calculate the Gini Impurity for the current set of data using the equation: For each candidate split (D left ,D right ) based on feature f and threshold , calculate the weighted Gini Impurity using the equation: e. Choose the split ( ) that minimizes the weighted Gini Impurity: ( ) = arg min f.Check if the stopping criterion is met for the current node.
i.The number of samples in the current node falls below a predefined threshold ( = 2).In this case, no further splits are allowed, and the current node becomes a leaf node.ii.The ( ) is set to 1, each node is allowed to become a leaf node with a single sample.g.If the stopping criterion is met, mark the current node as a leaf node and assign it the class label that is most prevalent among the remaining samples in the node.h.If the stopping criterion is not met, continue building child nodes by splitting the current node based on the selected split ( ).

Classification i.
To classify a new traffic data point x, traverse each tree t in the ensemble from the root node to a leaf node.ii.
At each node, apply the decision function ) to determine the branch to follow.iii.
Accumulate the class labels ŷ t obtained from all trees.iv.
Use majority voting, as defined by the equation below, to determine the final predicted class label ŷ for x: If ŷ = 0, classify the input traffic as "normal", otherwise classify the type of "botnet".www.nature.com/scientificreports/For a given feature set X and labels y, the Random Forest classifier aims to construct n estimators, each a decision tree denoted by f (x, � k ) , where, k are the random vectors independently sampled with the same distribution for all trees and k indexes through the ensemble of trees.The predictive function F(x) for a new sample x is obtained by averaging the predictions from all individual trees: The criterion "gini" refers to the Gini impurity, a measure of the frequency at which any element of the dataset will be mislabeled when it is randomly labeled according to the distribution of labels in the subset.Random Forest minimizes the total Gini impurity of individual trees.
In addition, our implementation initializes a Random Forest classifier called "rf " with 10 decision trees in this research.The default settings for the remaining hyperparameters have been chosen.When "max depth" is set to "None", trees grow until all leaves are pure or have less samples than "min samples split" (whichever comes first).When "min samples split" is set to 2, a node is only split if it has at least two samples.Since "min samples leaf " is set to 1, a leaf node must include at least one sample.All samples are equally weighted since "min weight fraction leaf " is set at 0.0.When "max features" is set to "auto" the number of features that may be utilized at each split is limited to a value that equals the square root of the total number of features.When "bootstrap" is set to True, each decision tree is constructed using bootstrap samples.The out-of-bag score is not computed since "oob score" is set to False."n_jobs" is set to "None", which means that all CPUs are used."random_state" is set to 42, which means that the same random state is used every time the classifier is trained."verbose" is set to 0, which means that no messages are displayed during the training process."warm_start" is set to False, which means that the previously trained trees are not used to initialize the new training."class_weight" is set to "None", which means that all classes are weighted equally."ccp_alpha" is set to 0.0, which means that no pruning is performed.Finally, "max_samples" is set to "None", which means that all samples are used to train each decision tree.

Stacking ensemble method
Training a meta-classifier to aggregate the predictions of various base classifiers is a step in the ensemble learning process known as stacking 58 .The meta-classifier chosen is Logistic Regression for this research, which is a linear classifier used to estimate probabilities of binary outcomes 39 .The Stacking classifier is created using a list of individual classifiers which are the same as the previous ensemble methods.Each individual classifier is trained on the same data and is expected to provide a different prediction.The Stacking ensemble classifier then combines these predictions from each individual classifier and utilizes the meta-classifier to arrive at a final prediction.The solver parameter specifies the algorithm used to optimize the loss function.In the implementation of this method LogisticRegression(solver = 'lbfgs' , random_state = 42) is used, "random_state = 42" sets the random seed to 42, so the results of the logistic regression model will be reproducible if the code is run again with the same value of 42 for "random_state".This is essential for testing, troubleshooting, and contrasting various models.The "lbfgs" solver, which stands for Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm, is employed in this situation.It is a quasi-Newton approach that seeks for the minimum of the loss function by approximating the Hessian matrix, which is the second derivative of the loss function 59 .

Assessment metrics employed to evaluate the model's effectiveness
Utilizing a variety of assessment criteria, we have demonstrated the efficacy of our suggested approach in detecting botnets.These metrics are succinctly outlined as follows: Accuracy A frequent assessment statistic in machine learning to assess the effectiveness of a classification model is accuracy.It evaluates the percentage of correctly categorized cases among all instances 60 .
The accuracy is computed by dividing the total number of occurrences (TP + TN + FP + FN) by the number of cases that were properly classified here True Positives (TP) (instances that are actually positive and are correctly predicted as positive), True Negatives (TN) (instances that are actually negative and are correctly predicted as negative), False Positives (FP) (instances that are actually negative but are predicted as positive), False negatives (FN) Situations where a positive outcome is predicted but truly happens).

Precision
Another frequently used evaluation parameter in machine learning to evaluate the efficacy of a classification model is precision.Of all occurrences that were expected to be positive, it calculates the percentage of genuine positives.As it implies a larger proportion of real positives among all instances predicted as positive, a higher accuracy score is preferable.Precision levels vary from 0 to 1, with 0 indicating no precision and 1 representing complete precision (all cases predicted as positive are really positive) (all instances predicted as positive are actually negative) 60 .
The accuracy is computed by dividing the total number of instances predicted as positive (TP + FP) by the number of true positives (TP): The recall is derived by dividing the total number of real positive instances (TP + FN) by the number of true positives (TP): A higher recall value is better, as it indicates that the model is correctly identifying more actual positive instances.

F1-score
Machine learning uses the F1-score as an assessment parameter to evaluate the trade-off between recall and accuracy.Its range is 0 to 1, with 1 indicating the ideal balance between precision and memory.It is the harmonic mean of precision and recall 61 .
Calculating the F1-score involves dividing the total of the accuracy and recall outputs by their product: Cohen's Kappa Cohen's Kappa is frequently used to assess how well a classification model is working.The observed agreement between the model's predictions and the actual labels is compared to the predicted agreement that would happen by chance to determine Cohen's Kappa.A model is considered to be effective if the observed accuracy value exceeds the expected accuracy 62 .The formula for Cohen's Kappa is as follows: here Po: the percentage of actual labels that agree with predictions from the model, Pe: the percentage of anticipated concordance between the model's predictions and the actual labels as determined by chance.

The values of Po and Pe can be calculated as follows:
Area under the ROC curve Machine learning models frequently use the performance metric AUC 63 .The TPR vs FPR for various categorization criteria is plotted on the ROC (Receiver Operating Characteristic) curve.
The area under the ROC curve, or AUC, is a binary classification problem's measure of how separable the positive and negative classes are.The AUC goes from 0 to 1, where 0 denotes a model that is flawlessly awful and consistently predicts the incorrect class and 1 denotes a model that is flawlessly good and consistently predicts the right class.A random model with an AUC of 0.5 is no better than guessing 64 .The AUC score is calculated by the Eq.(9).where s p is the sum of all the ranked positive instances, and n p and n n stand for the number of positive and negative examples, respectively.
A high AUC shows that the model is good at differentiating between positive and negative classes.Because it is indifferent to both the selected classification threshold and the class imbalance in the data, it is a valuable statistic.AUC may also be used to compare how several categorization models perform on the same dataset.(

Result and discussion
In the following section, we evaluate the effectiveness of the approach we suggested for detecting botnets.In order to visualize the model, we first use the N-BaIoT dataset and the extra trees ensemble technique.The "Appendix" section contains a full description of each of the 116 features that make up the N-BaIoT dataset.We use 1,018,298 initial samples in total from this dataset for our experiment.
The label column in the N-BaIoT dataset encounters eleven different types of values, including benign, gafgyt_ junk, mirai_udp, mirai_syn, mirai_scan, gafgyt_udp, mirai_ack, gafgyt_tcp, mirai_udpplain, and gafgyt_combo.Among these, "benign" indicates normal flows, whereas each of the other categories is botnets.Using a variety of evaluation criteria, comparisons to other datasets, use of various ensembles, we examine the model's performance.
To record the outcomes of the model's application across diverse datasets, we employ the identical procedures elaborated extensively in the section titled "Proposed Model Development", By changing simply the datasets, the model's effectiveness is being evaluated together with its potential for botnet detection.As a result, we evaluate the effectiveness of our model's botnet detection by comparing it with other botnet detection models that are currently available.

Analysis of the findings
The dataset's class distribution is illustrated in Fig. 2 and includes eleven different class types.The percentage of each class is shown within the relevant pie slice, which represents each class as a slice.The picture gives a fast grasp of the relative proportions of each class by visualizing how the dataset is distributed throughout the various classifications.
After category labels were converted into numerical representations using the LabelEncoder(), the classes are shown in Table 2.A distinct integer number between 0 and 10 is given to each class.The table provides a concise representation of the original class labels and their corresponding encoded values.
Following the application of the SMOTE technique, each category within the target column is represented by 237,665 samples, achieving a balanced dataset.Figure 3 presents the confusion matrix derived from evaluating the model on test data (20%).Each cell in the heatmap represents the count of predictions made by the model for each class.The y-axis displays the actual labels, while the x-axis displays the predicted labels.Annotating each cell with the respective count provides a clear and comprehensive overview of the model's performance and the distribution of predictions across different classes.Based on the computations in the figure and Table 3, the model performs effectively as shown by the large number of True Positives (TP) in various classes.The considerable counts of True Positives indicate that the model successfully identified and labeled instances belonging to their respective classes, such as gafgyt_combo, gafgyt_junk, gafgyt_scan, mirai_ack, and others.
Examining the False Positives (FP), which represent instances incorrectly classified as positive, we observe relatively low counts across most classes.This suggests that the model has a reasonably low rate of misclassifying instances as positive when they actually belong to a different class.The False Negatives (FN), show minimal counts or even zero for most classes.This implies that the model has a strong ability to correctly identify instances that do not belong to a specific class.In addition, the True Negatives (TN), which represent instances correctly classified as negative, we observe high counts across the majority of classes.This indicates the model's proficiency in accurately identifying instances.
According to the confusion matrix, the average True Positive Rate (TPR) of 99.99% indicates that the model has a nearly perfect ability to correctly identify positive instances across all classes.Moreover, the average False Positive Rate (FPR) of 0.00% suggests that the model has an extremely low rate of falsely classifying negative instances as positive.
The evaluation metrics for each class in our multiclass classification model for various forms of botnets are shown in Table 4.The model performs exceptionally well across all classes, showing excellent values for F1-score,   www.nature.com/scientificreports/recall, accuracy, and precision, all at 1.0000 (100%).This means that the model successfully categorizes each class with almost perfect accuracy, demonstrating a very trustworthy and powerful prediction capacity.Due to its extraordinary effectiveness, it is a trustworthy and beneficial tool for multiclass classification-based botnet identification in a variety of real-world circumstances.
The evaluation metrics (weighted) for the various ensemble approaches for botnet detection employed in this research are shown in Table 5.The research compares Extra Trees, Random Forest, Bagging, Extreme Gradient Boosting, and Stacking as the five ensemble approaches.The model with the Extra Trees Ensemble approach stands out as the best performer for botnet identification among the ensemble methods examined.It has almost flawless scores in Accuracy, Precision, Recall, and F1-score, demonstrating an amazing capacity to correctly identify botnet occurrences while reducing false positives.Notably, the Extra Trees model has a remarkable FPR    6 as assessment metrics.The model using the Extra Trees Ensemble technique has the greatest Balanced Accuracy (BACC) score of 0.9999, demonstrating its ability to correctly identify instances as belonging to botnets or not even when the data is unbalanced.The model's accuracy in creating accurate predictions is demonstrated by the incredibly low Error Rate of 0.0000, underscoring its dependability for botnet-detecting jobs.A Training Accuracy of 1.0000, indicating the Extra Trees approach's capacity to precisely match the training data, is also achieved.Additionally, the high Testing Accuracy score of 0.9999 suggests that the model can generalize well to unseen data, making it a robust solution for real-world botnet detection scenarios.
The average AUC Score, Cohen's Kappa, Observed Accuracy (Po), and Expected Accuracy (Pe) for several ensemble approaches are shown in Table 7 as other assessment metrics.With a Cohen's Kappa of around 0.9999, the model with the Extra Trees Ensemble approach shows outstanding agreement between its predictions and the actual classifications.Its outstanding capacity to precisely detect botnets is indicated by its high Observed Accuracy (Po) and AUC Score of 0.9999 and 1.0000, respectively.Additionally, the Expected Accuracy (Pe) is substantially lower at 0.0909, emphasizing that the model's performance is significantly better than random chance.The findings highlight the model with the Extra Trees model's potential as a powerful and reliable tool for identifying botnets, making it an excellent option for cybersecurity applications.
Overall, the proposed model with the Extra Trees Ensemble approach shows to be an excellent choice for botnet identification based on the assessment criteria shown in the Tables 5, 6 and 7.Because of its remarkable accuracy, precision, and generalization abilities, it has the potential to be an effective tool for enhancing cybersecurity and preventing the widespread issue of botnets.
Figure 4 shows the ROC Curve of TPR against FPR for various classes.The TTPR and the FPR connection at various thresholds is depicted in this picture by a ROC curve.With an AUC score of 1, the model is considered to have almost flawless classification performance and very high discriminatory power.According to the ROC curve, the model is very capable of differentiating between positive and negative classes.
We may therefore draw the conclusion that the suggested model with extra trees is very accurate and can successfully and confidently detect botnet attacks.A precision-recall curve, shown in Fig. 5, illustrates the trade-off between accuracy and recall for various threshold levels.This curve visually illustrates the alterations in precision and recall values when adjusting the decision threshold for classifying samples.On the x-axis, we observe the recall, which is synonymous with the true positive rate or the percentage of actual positive cases correctly identified as positive by the model.This graphical representation offers insights into the model's capacity to differentiate between distinct classes, enabling a side-by-side comparison of precision and recall values for each class.
Since the precision and recall are both 99.99%, the curve should show a high precision and recall rate throughout, indicating that the model is performing very well.A perfect precision-recall curve would be a right angle, going up to the top at the perfect recall of 1, then continuing horizontally at the perfect precision of 1.If the curve is a straight line, it would mean that the model's precision and recall rates are consistent throughout different threshold values.Therefore, based on the precision-recall curve, we can infer that the model is performing exceptionally well, achieving very high precision and recall rates for different threshold values.
Figure 6 presents a graphical representation illustrating the accuracy variations of an Extra Trees Classifier model across various ensemble sizes.On the x-axis, you can observe the number of trees in the ensemble,  Based on the findings, it's evident that as the number of trees in the model increases, both the training accuracy and test accuracy converge towards a value of 1.0, or very close to it.This suggests that the model is proficient at accurately classifying the data.The training accuracy begins at roughly 0.99 and swiftly reaches 1.0 when employing 10 trees as estimators, maintaining a similar level with further increases in estimators.This underscores the model's exceptional fit to the training data, achieving near-flawless accuracy.Simultaneously, the test accuracy commences at a high level of approximately 0.99 and gradually nears 1.0 as the number of trees grows.This indicates that the model demonstrates robust generalization to new, unseen data, as the accuracy on the test set consistently improves with the addition of more trees.
Overall, the generated figure demonstrates that the Extra Tree Classifier model is highly effective for the detection of botnets.The accuracy values are consistently high, both on the training and test sets, indicating that the model is able to accurately classify instances of botnet behavior.

Comparison of the proposed models' evaluation metrics against the ones of recent models
The proposed model displays outstanding performance on multiple datasets comparing existing models, as depicted in Table 8.The table only contains published models with datasets covering the years 2020-2023.On the N-BaIoT dataset, the model achieves near-perfect accuracy, recall, precision, and F1-score, all reaching 99.99%.This exceptional accuracy demonstrates the model's ability to accurately classify both positive and negative instances, making it highly reliable for detecting botnets in this dataset.Similarly, on the Bot-IoT dataset, the proposed model achieves perfect scores of 100.00% in all evaluation metrics, indicating its remarkable precision and recall in distinguishing botnet activities from normal traffic.The proposed model consistently delivers remarkable results not only on the N-BaIoT and Bot-IoT datasets but also on other datasets, including CTU-13, ISCX, CICIDS, and CCC.Its exceptional performance across diverse datasets underscores its effectiveness in detecting botnet activities and reinforces its position as a powerful and reliable solution for enhancing cybersecurity measures.Comparing the proposed model with other existing models on the same datasets reveals its superiority in botnet detection.
The proposed scheme for botnet detection surpasses other existing approaches for all evaluation metrics and displays impressive performance across a variety of datasets.Its outstanding results validate its effectiveness in detecting botnet activities and highlight its potential as a robust solution for enhancing cybersecurity measures against botnet threats.

The proposed models accuracy with computational complexity for various datasets
Table 9 presents the results of the model on different botnet datasets.The table provides accuracy scores and other performance metrics including TPR, Training Accuracy, Testing Accuracy, and AUC Score.The training accuracy is consistently high, being 100.0%for all datasets, suggesting that the model has learned the training data very well and does not suffer from overfitting.The testing accuracy is also very high, indicating that the model generalizes effectively to unseen data.The TPR or sensitivity is generally high, ranging from 98.50% to 100.0%.TPR represents the model's ability to correctly identify positive instances (botnets) from the total number of actual positives.High TPR values indicate that the model is successful in identifying botnets.The AUC score, which represents the model's overall ability to discriminate between positive and negative instances, is consistently high, with values ranging from 99.00 to 100.0%.A high AUC score suggests that the model is highly capable of distinguishing between botnet and non-botnet traffic.In our research, the underlying algorithm of focus is the ExtraTreesClassifier leverages an ensemble of decorrelated decision trees, which introduces both variance reduction and an increase in computational complexity.The core of the ExtraTreesClassifier's complexity lies in the construction of N decision trees, where each tree is built from a bootstrap sample of the training data.The complexity of constructing a single decision tree is O M * K * logK , where M represents the number of training samples and K is the number of samples used at each node to determine the best split.In the case of the ExtraTreesClassifier, since the splits are chosen randomly and not from all possible splits, the complexity is reduced to O M * logK .In our experiments, we account for the feature selection process preceding the application of the ExtraTreesClassifier.The selection techniques employed, such as mutual information and PCA, add their own computational costs, which are respectively O(M * N)andO N 2 , where N refers to the number of features.However, as these steps reduce the dimensionality of the problem, they can actually lead to a reduction in the subsequent computational burden during the training of the classifier.Table 10 delineates the computational expenditure entailed in deploying an ExtraTreesClassifier across a spectrum of botnet datasets, illustrating the model's efficiency and resource utilization.It provides a meticulous account of the temporal demand for both the training and testing phases, measured in seconds, alongside a quantification of the memory resources requisitioned during these stages, presented in megabytes (MB).This comprehensive portrayal aids in discerning the practical implications of model deployment in varied operational environments, thereby facilitating informed decisions regarding the balance between computational costs and the performance efficacy of the model.The ExtraTreesClassifier presents a trade-off between computational complexity and prediction accuracy.Despite the apparent high theoretical complexity, practical implementations benefit significantly from parallel computations.Our experimental setup and evaluations are designed to reveal the nuanced relationship between algorithmic complexity and model performance, ultimately guiding the user in selecting an appropriately balanced model for their specific real-world tasks.
Figures 7 and 8 offer a compelling visualization of the interpretability aspect of our extra trees ensemble model's predictions through the use of SHAP (SHapley Additive exPlanations) values.SHAP values provide a powerful framework for interpreting machine learning models by assigning an importance value to each feature for a particular prediction.
In Fig. 7, we observe a stacked bar chart detailing the mean SHAP values for different features across multiple classes.This visualization effectively communicates the average impact of each feature on the model output magnitude, allowing us to appreciate the relative importance of each feature in the classification process.The varied color coding corresponds to distinct classes, emphasizing the differentiated influence each feature exerts across various predicted classes.This nuanced depiction of feature impact is crucial in understanding how the extra trees ensemble model processes inputs to arrive at its decisions, enhancing trust in its predictions.
Figure 8 further enriches our interpretability discourse by presenting a SHAP value summary plot for a single prediction.This plot illustrates how each feature value contributes to the deviation from the base value (the model's output value if no features had an effect) for a specific instance.Features pushing the prediction higher are shown in red, while those contributing to a lower prediction are in blue, providing a clear, intuitive understanding of the directionality and magnitude of each feature's effect.
The positive aspect of these figures lies in their ability to convey the complexities of our ensemble model's decision-making in an accessible and informative manner.Through these visualizations, we not only affirm the robust predictive power of our model but also its transparency, allowing for a deeper insight into the 'why' and 'how' behind its predictions.This interpretability is instrumental in validating the model's decisions and establishing a foundation for trust with end-users, making it an indispensable feature for real-world applications where understanding model predictions is as critical as their accuracy.
The proposed approach to feature selection is robust and multi-faceted, integrating correlation analysis, mutual information, and principal component analysis (PCA).This trifecta of techniques ensures a comprehensive understanding of feature relevance, capturing a broad spectrum of data characteristics that extend beyond the specifics of the training dataset.Such an inclusive feature selection methodology can be instrumental in identifying latent attributes indicative of botnet activity, thereby positioning the model to better generalize to    www.nature.com/scientificreports/success demonstrates the effectiveness of the suggested methodology in accurately identifying botnets in diverse scenarios, and it holds the potential to detect new and previously unseen types of botnets.Consequently, the model's capabilities offer a substantial enhancement to computer system and network security by strengthening defenses against evolving botnet attacks.

Conclusion
In the vanguard of cybersecurity, the escalation of botnet sophistication presents a formidable challenge, one that demands an equally evolved countermeasure.Our investigation, rooted in a pioneering blend of hybrid feature selection techniques and the prowess of the Extra Trees ensemble classifier, heralds a new epoch in botnet detection strategies.The model we have meticulously engineered not only achieves an unprecedented accuracy exceeding 99.99% but also maintains an exceptional True Positive Rate (TPR) of 99% alongside a virtually nonexistent False Positive Rate (FPR) of 0.00%.The result provides the model's unparalleled discernment in distinguishing between benign and malicious network behaviors.Further bolstered by the robustness across diverse datasets, the model stands as a paragon of versatility.Its consistent performance in various scenarios underscores its utility as a formidable tool in the real-world arsenal against cyber threats.The harmonious confluence of precision, recall, and F1-scores as evinced in our evaluation metrics bespeaks a balanced approach that mitigates the risks of overfitting while ensuring the retention of predictive power.
Network administrators and cybersecurity defenders can leverage this sophisticated detection model as a vigilant sentinel, guarding the sanctity of digital infrastructures.The adoption of this scheme promises a substantial elevation in network security and a stride forward in obviating the vulnerabilities that plague our interconnected systems.
Despite the pronounced efficacy of our model, we acknowledge certain limitations that pave the way for future enhancements.The integration of real-time analysis and the exploration of deep learning architectures could offer substantial improvements, bridging the gap between static detection and dynamic, adaptive defense mechanisms.Consequently, our future research trajectory is poised to not only iterate upon this framework but to revolutionize it, ensuring that it remains at the forefront of cybersecurity innovation.
Training accuracy = Correctly predicted instances/total instances in training data (19) Testing accuracy = Correctly predicted instances/total instances in test data

Figure 2 .
Figure 2. Class distribution of the initial dataset.

Figure 6 .
Figure 6.Accuracy of the model for different no. of trees.
new botnet variants.Moreover, our model harnesses the strength of an ensemble-based methodology, utilizing the ExtraTreesClassifier. Ensemble techniques, by their very nature, amalgamate insights from multiple decision trees, reducing the risk of overfitting to training data and increasing the chances of detecting previously unseen patterns.The ExtraTreesClassifier, in particular, employs a randomized and high-variance approach to feature selection and splits within individual trees, which provides a breadth of perspectives on the data.This diversity in model architecture makes it more adept at identifying outlying behaviors that could signify emerging threats.It is also noteworthy to highlight that ensemble methods have been shown to be effective in handling non-stationary environments, due to their capacity to build a consensus across varied learners.The performance of our model on multiple botnet datasets, as demonstrated in our research, speaks to its precision and generalizability, two attributes that are crucial for detecting new forms of botnet activities.Overall, the proposed botnet detection model, incorporating hybrid feature selection and the extra trees ensemble classifier, consistently achieved high accuracy scores across various publicly available datasets.This

Figure 7 .
Figure 7. Mean SHAP values for feature importance across classes in ensemble model predictions.

Figure 8 .
Figure 8. Detailed SHAP value impact on a single model prediction.
on packet sizes at time lag 5 s HpHp_L0.01_stdStandard deviation of higher-layer protocol payload sizes in both directions at time lag 0.01 s HH_jit_L1_mean Mean of jitter on packet sizes at time lag 1 s MI_dir_L1_mean Mean of the direct mutual information between features at time lag 1 s H_L3_variance Variance of packet sizes at time lag 3 s HH_L5_radius Radius of the auto-correlation of packet sizes at time lag 5 s H_L1_mean Mean of packet sizes at time lag 1 s HH_L0.1_meanMean of the auto-correlation of packet sizes at time lag 0.1 s HH_L3_covariance Covariance of the auto-correlation of packet sizes at time lag 3 s HpHp_L0.1_weightWeight of higher-layer protocol payload sizes in both directions at time lag 0.1 s HH_L0.01_stdStandard deviation of the auto-correlation of packet sizes at time lag 0.01 s HH_L3_std Standard deviation of the auto-correlation of packet sizes at time lag 3 s HH_L1_magnitude Magnitude of the auto-correlation of packet sizes at time lag 1 s HpHp_L5_pcc Pearson correlation coefficient between higher-layer protocol payload sizes at time lag 5 s HpHp_L1_covariance Covariance of higher-layer protocol payload sizes in both directions at time lag 1 s HH_jit_L3_mean Mean of jitter on packet sizes at time lag 3 s HH_jit_L0.1_meanMean of jitter on packet sizes at time lag 0.1 s HpHp_L3_pcc Pearson correlation coefficient between higher-layer protocol payload sizes at time lag 3 s HH_L3_weight Weight of the auto-correlation of packet sizes at time lag 3 s MI_dir_L1_variance Variance of the direct mutual information between features at time lag 1 s HH_jit_L0.01_varianceVariance of jitter on packet sizes at time lag 0.01 s HpHp_L1_magnitude Magnitude of higher-layer protocol payload sizes in both directions at time lag 1 s HH_L1_mean Mean of the auto-correlation of packet sizes at time lag 1 s HH_L3_magnitude Magnitude of the auto-correlation of packet sizes at time lag 3 s HpHp_L0.01_magnitudeMagnitude of higher-layer protocol payload sizes in both directions at time lag 0.01 s HH_L0.01_radiusRadius of the auto-correlation of packet sizes at time lag 0.01 s HH_L1_weight Weight of the auto-correlation of packet sizes at time lag 1 s HH_L1_std Standard deviation of the auto-correlation of packet sizes at time lag 1 s HpHp_L5_mean Mean of higher-layer protocol payload sizes in both directions at time lag 5 s HpHp_L0.01_weightWeight of higher-layer protocol payload sizes in both directions at time lag 0.01 s H_L0.01_meanMean of packet sizes at time lag 0.01 s H_L0.1_weightWeight of packet sizes at time lag 0.1 s MI_dir_L3_mean Mean of the direct mutual information between features at time lag 3 s MI_dir_L0.01_weightWeight of the direct mutual information between features at time lag 0.01 s HH_L5_weight Weight of the auto-correlation of packet sizes at time lag 5 s H_L0.01_varianceVariance of packet sizes at time lag 0.01 s HpHp_L0.1_pccPearson correlation coefficient between higher-layer protocol payload sizes at time lag 0.1 s HH_jit_L3_weight Weight of jitter on packet sizes at time lag 3 s HH_jit_L5_weight Weight of jitter on packet sizes at time lag 5 s MI_dir_L0.01_varianceVariance of the direct mutual information between features at time lag 0.01 s MI_dir_L5_weight Weight of the direct mutual information between features at time lag 5 s HpHp_L0.1_magnitudeMagnitude of higher-layer protocol payload sizes in both directions at time lag 0.1 s HH_L3_radius Radius of the auto-correlation of packet sizes at time lag 3 s HpHp_L5_radius Radius of higher-layer protocol payload sizes in both directions at time lag 5 s Vol:.(1234567890)Scientific Reports | (2023) 13:21207 | https://doi.org/10.1038/s41598-023-48230-1www.nature.com/scientificreports/Feature title Description HH_L0.01_weightWeight of the auto-correlation of packet sizes at time lag 0.01 s H_L3_weight Weight of packet sizes at time lag 3 s MI_dir_L5_variance Variance of the direct mutual information between features at time lag 5 s HpHp_L3_weight Weight of higher-layer protocol payload sizes in both directions at time lag 3 s HH_L0.1_stdStandard deviation of the auto-correlation of packet sizes at time lag 0.1 s H_L3_mean Mean of packet sizes at time lag 3 s H_L1_weight Weight of packet sizes at time lag 1 s HH_jit_L0.01_weightWeight of jitter on packet sizes at time lag 0.01 s HH_L5_covariance Covariance of the auto-correlation of packet sizes at time lag 5 s MI_dir_L3_weight Weight of the direct mutual information between features at time lag 3 s HpHp_L1_std Standard deviation of higher-layer protocol payload sizes in both directions at time lag 1 s HpHp_L0.1_stdStandard deviation of higher-layer protocol payload sizes in both directions at time lag 0.1 s HH_L0.01_covarianceCovariance of the auto-correlation of packet sizes at time lag 0.01 s HpHp_L0.1_covarianceCovariance of higher-layer protocol payload sizes in both directions at time lag 0.1 s H_L5_variance Variance of packet sizes at time lag 5 s H_L5_weight Weight of packet sizes at time lag 5 s HpHp_L3_magnitude Magnitude of higher-layer protocol payload sizes in both directions at time lag 3 s HpHp_L3_radius Radius of higher-layer protocol payload sizes in both directions at time lag 3 s MI_dir_L5_mean Mean of the direct mutual information between features at time lag 5 s HH_jit_L3_variance Variance of jitter on packet sizes at time lag 3 s HH_L0.01_pccPearson correlation coefficient between auto-correlation of packet sizes at time lag 0.01 s HpHp_L1_weight Weight of higher-layer protocol payload sizes in both directions at time lag 1 s HH_jit_L5_mean Mean of jitter on packet sizes at time lag 5 s HH_L5_pcc Pearson correlation coefficient between auto-correlation of packet sizes at time lag 5 s HpHp_L0.1_meanMean of higher-layer protocol payload sizes in both directions at time lag 0.1 s HpHp_L5_std Standard deviation of higher-layer protocol payload sizes in both directions at time lag 5 s HH_jit_L0.1_weightWeight of jitter on packet sizes at time lag 0.1 s HpHp_L1_pcc Pearson correlation coefficient between higher-layer protocol payload sizes at time lag 1 s HH_L0.1_magnitudeMagnitude of the auto-correlation of packet sizes at time lag 0.1 s HpHp_L1_radius Radius of higher-layer protocol payload sizes in both directions at time lag 1 s HpHp_L1_mean Mean of higher-layer protocol payload sizes in both directions at time lag 1 s HH_L0.1_covarianceCovariance of the auto-correlation of packet sizes at time lag 0.1 s HpHp_L5_weight Weight of higher-layer protocol payload sizes in both directions at time lag 5 s HH_L0.01_magnitudeMagnitude of the auto-correlation of packet sizes at time lag 0.01 s HH_jit_L1_weight Weight of jitter on packet sizes at time lag 1 s MI_dir_L0.01_meanMean of the direct mutual information between features at time lag 0.01 s HH_jit_L0.1_varianceVariance of jitter on packet sizes at time lag 0.1 s H_L0.01_weightWeight of packet sizes at time lag 0.01 s MI_dir_L1_weight Weight of the direct mutual information between features at time lag 1 s HpHp_L0.01_radiusRadius of higher-layer protocol payload sizes in both directions at time lag 0.01 s HH_L1_pcc Pearson correlation coefficient between auto-correlation of packet sizes at time lag 1 s H_L0.1_varianceVariance of packet sizes at time lag 0.1 s

Table 2 .
Representation of classes after LabelEncoder application.
Vol.:(0123456789) Scientific Reports | (2023) 13:21207 | https://doi.org/10.1038/s41598-023-48230-1www.nature.com/scientificreports/ µ σ Balanced accuracy (BACC)The BACC metric is an important indicator for assessing a model's classification performance, especially when the dataset shows class imbalances.In order to provide an overall evaluation of a model's correctness, it balances its sensitivity (True Positive Rate) and specificity (True Negative Rate) across all classes.The following equation is used for calculating the BACC:The other metric like the Error Rate is a metric that quantifies the proportion of misclassified instances in a classification model.Lower Error Rate indicates better model performance.Training Accuracy measures how well a model performs on the data it was trained on.High Training Accuracy suggests the model has learned from the training data.Testing Accuracy evaluates a model's performance on a separate, unseen dataset (the test set).High Testing Accuracy indicates the model's ability to generalize to new, unseen data.Through the Eqs.(11)to(13)provided below, error rate, training accuracy, and testing accuracy are all measured.

Table 3 .
values for various classes in the confusion matrix.

Table 4 .
Evaluation metrics for each class.

Table 5 .
Evaluation metrics (average) − 1 for various ensemble techniques checked in this research.
Vol:.(1234567890) Scientific Reports | (2023) 13:21207 | https://doi.org/10.1038/s41598-023-48230-1 of 0.0000, making it the most accurate at differentiating between botnets and regular instances.These results demonstrate the Extra Trees Ensemble method's superiority over the other methodologies under investigation and demonstrate its potential as a highly efficient and reliable solution for actual botnet detection scenarios.The average error rate, BACC, training accuracy, and testing accuracy for several ensemble approaches used in this botnet detection model are shown in Table

Table 8 .
Comparative performances of different models with different datasets.(*) means not mentioned.

Table 9 .
Models accuracy on different botnet datasets.

Table 10 .
Computational overhead of the model for different datasets.